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Introduction 


Digital traces, often defined as “records of activity (trace data) undertaken through an online 
information system (thus, digital)” (Howison, Wiggins, & Crowston, 2011, p. 769) or “behav- 
ioral residue [individuals leave] when they interact online” (Hinds & Joinson, 2018, p. 2), pro- 
vide researchers with new opportunities for studying social and behavioral phenomena. These 
data come from a variety of technical systems, among others, business transaction systems, 
telecommunication networks, websites, social media platforms, smartphone apps, sensors built 
in wearable devices, and smart meters (Stier, Breuer, Siegers, & Thorson, 2019). Their analysis 
is a core part of computational social science (Edelmann, Wolff, Montagne, & Bail, 2020; Lazer 
et al., 2009). The excitement about digital trace data mainly stems from the fine-grained nature 
of the data that potentially allows researchers to observe individual and social behavior as well as 
changes in behavior at high frequencies and in real time. In addition, their measurement is non- 
intrusive; that is, the data collection happens without the observed person having to self-report. 
Removing human cognition and social interactions from the data collection process can miti- 
gate their well-documented negative impacts on the quality of self-reports (e.g., Tourangeau, 
Rips, & Rasinski, 2000). However, the true potential of digital trace data to answer a broad 
range of social science research questions depends on the features of the specific type of data, 
that is, how they were collected and from whom. The original definition of digital trace data 
is limited to data that are found, that is, data created as a by-product of activities not stemming 
from a designed research instrument. We argue that digital traces can and should sometimes be 
collected in a designed way to afford researchers control over the data generating process and to 
expand the range of research questions that can be answered with these data. 

Readers of this chapter will quickly realize that the use of digital trace data is in its early 
stages. We share the enthusiasm of many researchers to explore the capabilities of digital trace 
data, and enhance and systematize their collection. However, there is more research needed to 
tackle problems of privacy, quality assurance, and a good understanding of break-downs in the 
measurement process. 

In this chapter, we introduce digital trace data and their use in the computational social sci- 
ences (“Use of Digital Trace Data to Study Social and Behavioral Phenomena"). In order to 
successfully use digital trace data, research goals need to be aligned with the available data, and 
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researchers need to recognize that not all data are suitable to answer the most relevant questions 
(“What Is the Research Goal?"). Data quality also needs to be evaluated with the research goal 
in mind (“Quality Assessment — Quality Enhancement”). To ensure reproducibility and rep- 
licability, documentation of digital trace data collection and processing is necessary, and creat- 
ing sufficient transparency might be even harder than it already is with traditional data sources 
(“Transparency and Reporting Needs”). 


Use of digital trace data to study social and behavioral phenomena 


Digital trace data allow researchers to study a variety of social and behavioral phenomena and 
can be organized in a number of ways. Given the fast-paced development of digital technology 
and the concomitant emergence of novel forms of digital trace data, a mere taxonomy of the 
types of digital trace data (e.g., social media data, Internet search data, geolocation data from 
smartphones) might become outdated quickly. Instead, we organize this section along dimen- 
sions of their use in computational social science (i.e., what type of phenomenon is studied?) 
and the type of observation used when collecting the data (i.e., how obtrusive is the observa- 
tion?). This broader perspective might help social researchers to detect new sources of digital 
trace data and assess properties of digital trace data that already exist and data sources that will 
emerge in the future. 


Type of phenomenon to be studied 


We use two dimensions to describe the types of phenomena that can be studied with digital 
trace data (see Figure 7.1). First, we differentiate between phenomena that pertain to individual 
behavior and those that represent social interactions involving multiple individuals. Second, we 
distinguish between digital and analog phenomena, building on and extending the classifica- 
tion of mobile sensing data by Harari, Müller, Aung, and Rentfrow (2017). Digital phenom- 
ena are types of behaviors and interactions that happen while using a digital device, such as 
browsing the Internet, posting a comment on a social media platform, or making a video call. 
These behaviors and interactions are inherently digital, as they could not happen without the 
use of digital technology. Analog phenomena are behaviors and social interactions that people 
encounter in their everyday lives and that existed well before the age of digital technology, 
including face-to-face communication, physical activity, mobility, and sleep. While the phe- 
nomena themselves happen without the use of digital technology, the ubiquity of smartphones, 
wearables, sensors, and other digital devices leaves a digital trace about them that researchers 
can leverage. 

The combination of these two dimensions creates four broad categories of phenomena 
that can be measured using digital trace data: digital individual behavior (e.g., browsing the 
Internet, typing a query into an online search engine, using an app), analog individual behav- 
ior (e.g., sleeping, working out, doing chores), digital social interactions (e.g., video calling, 
text messaging), and analog social interactions (e.g., face-to-face conversations). While it 
is helpful to organize phenomena along these four categories, we acknowledge that there 
can be overlap between the groups. In particular, behaviors and interactions that used to be 
primarily analog have become increasingly digital over time. Consider, for example, driving; 
driving is inherently an analog individual behavior that does not necessarily require digital 
technology. However, increasingly, cars rely on a combination of traditional mechanics and 
digital technologies for navigation, safety, and autonomous driving (Horn & Kreuter, 2019), 
changing driving from a primarily analog behavior to a digital behavior in the near future. 
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Figure 7.1 Examples of analog and digital behaviors and interactions that can be studied using digital 
trace data 


Similarly, while individuals have always worked together on projects even without the help of 
digital technologies, collaborative work is increasingly done via platforms, such as Dropbox, 
Google Docs, and GitHub. Other phenomena can be considered as muddling the boundaries 
between individual behavior, social behavior, and social interaction. For example, while post- 
ing something on a social media platform is at first an individual behavior (in particular if the 
site is private and there are no followers), the post might trigger a conversation with other 
users, leading to social interactions. Figure 7.1 plots examples of analog and digital behaviors 
and interactions that can be measured using digital trace data in a two-dimensional space 
along our two dimensions. 

One important piece that we will discuss in “What Is the Research Goal?" in more detail is 
the absence of digital trace data in all of these quadrants for certain people and certain behaviors 
through selective use of digital devices. It is very easy to get blindsided by the vast amount of 
data available and to overlook what is not there. Results of research projects can be easily biased. 


Type of observation 


Similar to traditional observational methods in the social sciences, the observation of aforemen- 
tioned individual behaviors and social interactions using digital trace data can be more or less 
obtrusive, depending on how aware the individuals are of the fact that they are being observed 
and that their data are used for research. For example, a form of unobtrusive collection of digital 
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trace data happens when private companies utilize technologies such as cookies and browser 
fingerprinting to collect information about the browsing behavior of Internet users (Lerner, 
Simpson, Kohno, & Roesner, 2016). Data brokers (e.g., Acxiom, LexisNexis Risk Solutions, 
Experian) provide vast amounts of these data back to interested parties, mainly with the goal to 
infer user attributes (e.g., sociodemographics, personal and political interest) from the contents 
visited for targeted marketing and political campaigning (Duhigg, 2012; Kruschinski & Haller, 
2017; Nickerson & Rogers, 2014). 

As a consequence of the introduction of the EU General Data Protection Regulation 
(GDPR) in May 2018, website providers have started asking Internet users to agree to the terms 
and conditions and accept cookies upon entering their website. However, only a small fraction 
of users seem to be actively reading and understanding what information they are agreeing to 
share with the website and third parties (Obar & Oeldorf-Hirsch, 2020). While in some con- 
texts, for example, when using an online shop such as Amazon, users might expect their data 
to be used for various purposes, in other cases, for example, when scientists and researchers use 
the platform ResearchGate' to share papers, users might be surprised about the amount of data 
that is collected about them and with whom they are shared. 

Digital trace data can also be collected in an unobtrusive manner from social media plat- 
forms where users post comments, share content, and interact with each other. These data can 
usually be scraped or accessed via an application programming interface (API). While APIs are 
usually not primarily designed for a research purpose but for software systems to communicate 
with each other, social scientists have explored the use of, for example, Twitter data to study 
political communication (Jungherr, 2015); Reddit data to measure strength of attitudes on 
politics, immigration, gay rights, and climate change (Amaya, Bach, Keusch, & Kreuter, 2020); 
and Facebook data to study friendship networks (Cheng, Adamic, Kleinberg, & Leskovec, 
2016; Ugander, Karrer, Backstrom, & Marlow, 2011). While posts, for example, on Twitter are 
public by default, only few users are aware that their tweets are used by researchers (Fiesler & 
Proferes, 2018). 

Several other forms of unobtrusive digital trace data have been used to study behavior and 
other social phenomena, for example: 


e Researchers have used aggregated data from online search engines and the queries users 
post there to study consumer trends (Vosen & Schmidt, 2011), tracking of disease out- 
breaks such as influenza (Ginsberg et al., 2009), tracking of economic crises (Jun, Yoo, & 
Choi, 2018), political polarization (Flaxman, Goel, & Rao, 2016), and migration (Bóhme, 
Gróger, & Stohr, 2020; Vicéns-Feliberty & Ricketts, 2016). 

e Blumenstock, Cadamuro, and On (2015) used anonymized mobile phone metadata from 
cellular network operators to predict poverty and wealth in Africa. 

e Gobel and Munzert (2018) studied how German politicians enhance and change their 
appearance over time based on traces of changes to biographies on the online encyclopedia 
Wikipedia. 

* Edelman, Luca, and Svirsky (2017) used Airbnb postings to understand racial discrimination. 

e The Billion Price Project scrapes online prices to measure consumption and inflation across 
countries (Cavallo & Rigobon, 2016). 

e  Przepiorka, Norbutas, and Corten (2017) studied reputation formation in a cryptomarket 
for illegal drugs using price and buyers’ ratings data of finished transactions. 

e Philpot, Liebst, Levine, Bernasco, and Lindegaard (2020) analyzed bystander behavior, that 
is, whether and how individuals intervene during an emergency when in the presence of 
others or alone, using footage from closed-circuit television (CCTV) in public spaces. 
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e Social epidemiologists increasingly use electronic health record data to study, for example, 
the impact of built and social environment, for example, poverty rates in certain geographic 
areas, on health outcomes (Adler, Glymour, & Fielding, 2016). 

e Several large-scale projects have deployed connected environmental sensors (Internet of 
Things [Io T]), measuring, for example, temperature, humidity, air quality, noise levels, and 
traffic volume, in so-called "smart cities" allowing researchers access to urban measure- 
ments with greater spatial and temporal resolution (Benedict, Wayland, & Hagler, 2017; 
Catlett, Beckman, Sankaran, & Galvin, 2017; Di Sabatino, Buccolieri, & Kumar, 2018; 
English, Zhao, Brown, Catlett, & Cagney, 2020). 


In contrast, some collection of digital trace data is much more obtrusive in that the individu- 
als who produce the data are made explicitly aware of the fact that their data are used for 
research purposes. That is, they have to consent to the data collection and install a designated 
research app to their smartphone, download a meter and install it as a plugin to their Internet 
browser, or wear a sensor on their body. Smartphones in particular have become popular data 
collection tools among social and behavioral scientists (Harari et al., 2016; Link et al., 2014; 
Raento, Oulasvirta, & Eagle, 2009), because many users carry their phones around with 
them throughout the day, allowing for real-time, in situ data collection using the growing 
number of sensors built into these devices (see Figure 7.2). Using designated research apps, 
researchers can get access to log files that are automatically generated by a device's operating 
system, enabling the collection of information about the usage of the device for tasks like 
texting, making and receiving phone calls, browsing the Internet, and using other apps on 
smartphones (i.e., digital behaviors and interactions). These data allow researchers to study, 
among others, social interactions (e.g., Keusch, Báhr, Haas, Kreuter, & Trappmann, 2020c), 
and even infer personality based on how users interact with the smartphone and what apps 
they use (e.g., Stachl et al., 2020). 'The native sensors built into smartphones and other wear- 
able devices enable the measurement of users’ current situation and their behavior outside 
of the generic functions of the phone, where the device is merely present in a given context 
(i.e., analog interactions and behaviors). For example, researchers have collected informa- 
tion about smartphone users' location and movements via global navigation satellite systems 
(GNSS), Wi-Fi, and cellular positioning, proximity to others using Bluetooth, and physical 
activity through accelerometer data. In addition, a combination of sensors (e.g., microphone, 
light sensor, accelerometer) can be used to capture information about the smartphone’s and — 
by extension — the participants ambient environment, inferring frequency and duration of 
conversation and sleep (e.g., Wang et al, 2014), as well as levels of psychological stress 
(Adams et al., 2014). 

To provide context to the passively collected sensor and log data, researchers often administer 
in-app survey questions that inquire about phenomena such as subjective states (e.g., mood, 
attitudes) that require self-report (Conrad & Keusch, 2018). This combined approach of self- 
report and passive measurement on smartphones has been used to study, among others, mobility 
patterns (Elevelt, Lugtig, & Toepoel, 2019; Lynch, Dumont, Greene, & Ehrlich, 2019; Scher- 
penzeel, 2017), the influence of physical surroundings and activity on psychological well-being 
and health (Goodspeed et al., 2018; Lathia, Sandstrom, Mascolo, & Rentfrow, 2017; MacKer- 
ron & Mourato, 2013; York Cornwell & Cagney, 2017), student well-being over the course of 
an academic term (Ben-Zeev, Scherer, Wang, Xie, & Campbell, 2015; Harari, Gosling et al., 
2017; Wang et al., 2014), integration efforts of refugees (Keusch et al., 2019), job search of men 
recently released from prison (Sugie, 2018; Sugie & Lens, 2017), the effects of unemployment 
on daily life (Kreuter, Haas, Keusch, Bahr, & Trappmann, 2020), and how students interact with 
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each other across a variety of communication channels (Sapiezynski, Stopczynski, Lassen, & 
Lehmann, 2019; Stopczynski et al., 2014). 

Digital traces can also be collected from wearable devices such as wrist- or waist-worn 
trackers that measure physical activity and, depending on the device type, additional infor- 
mation such as heart rate and geolocation. Some studies have recruited existing users of 
consumer-grade fitness trackers (e.g., Fitbit, Garmin) and smartwatches (e.g., Apple Watch) 
to share their data with researchers (Ajana, 2018). For example, over 500,000 German vol- 
unteers donated data collected from their fitness wristbands and smartwatches to the Robert 
Koch Institute (RKI) during the COVID-19 pandemic.? Some studies have used platforms 
such as Fitabase? to get access to the wearable data of people recruited into their studies (Phil- 
lips & Johnson, 2017; Stück, Hallgrímsson, Ver Steeg, Epasto, & Foschini, 2017). In popula- 
tion studies, another option is to equip all participants with the same research-grade wearable 
device (e.g., Actigraph, Geneactive) and then collect the devices at the end of the field period 
(Harris, Owen, Victor, Adams, & Cook, 2009; Kapteyn et al., 2018; Troiano et al., 2008). 

Another approach of obtrusive digital terrace data collection is the use of online track- 
ing applications (“meters”) that users need to actively install to their Internet browsers and/ 
or mobile devices to allow the collection of browsing histories and app usage. This approach 
allows the researcher to trace individual online behavior, for example, news consumption via 
social media websites (Scharkow, Mangold, Stier, & Breuer, 2020), across time. Linking behav- 
ioral meter data with self-reports from web surveys allows researchers to study, for example, 
the relationship between passively measured online news consumption and self-reported voting 
behavior (Bach et al., 2019; Guess et al., 2020) or online news consumption and political inter- 
est (Móller, van de Velde, Merten, & Puschmann, 2019). 

Depending on the design of the study, the measurement might be perceived as being less 
obtrusive over time because participants forget or get used to the presence of the measurement 
device. If the only active task for the participant is, for example, to download a research app 
or a meter that collects data in the background on their smartphone or Internet browser, then 
participants might soon forget that their behavior is even being observed. However, wear- 
ing a research-grade device on the body will potentially serve as a constant reminder that the 
individual is part of a study. Similarly, if the study design involves a combination of passive 
measurement of digital traces and repeated collection of self-reports (e.g., ecological momen- 
tary assessment [EMA] questions multiple times a day), it will probably make participants more 
aware of the observational part of the study. 


What is the research goal? 


Given that digital trace data are often collected incidentally and are reused for research purposes, 
it helps to take a step back and examine the goal of the research. Different research questions 
place different requirements on data and might create the need to go above and beyond readily 
available (digital trace) data. To simplify the discussion, we differentiate between three very gen- 
eral research goals irrespective of the data types: description, causation, and prediction. Of course, 
any given research project might combine several of these aspects or include variants not spelled 
out here in detail. This is not the first time we have discussed these issues. Readers interested in 
our presentations in other contexts can refer to Foster, Ghani, Jarmin, Kreuter, and Lane (2020) 
for general big data methods and privacy topics and Kohler, Kreuter, and Stuart (2019) for more 
detailed thoughts on causality and prediction. 

When social scientists aim at describing the state of the society or a special population 
within the society, they typically seek to report a mean, a median, or a graphical distribution of 
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a variable of interest. A first decision has to be made when interpreting the descriptive statistic. 
Researchers need to be clear if their aim is to describe a population or only report on the data 
at hand. In the case of a census data collection, where by definition all units of the population 
are covered, the two aspects overlap. In all other cases, an extra step is needed, which is often 
difficult when dealing with digital trace data. Say, for example, a researcher is scraping job post- 
ings in Germany in the first week of May in a given year. She can then describe the percentage 
of data scientists sought in her scraped set of posts and only in those. Such restrictions need to 
be communicated clearly when presenting and publishing the results. A much harder task is to 
estimate the percentage of data scientists searched for by all German companies in that year with 
such data at hand. 

When data are not available for the entire population but accessed via samples, the goal of 
inferring to the population is solved by taking a sample with known selection probabilities and 
ensuring that everybody from the population of interest has a positive selection probability. Doing 
so requires a sampling frame that ideally covers the entire population. In the scraping example, 
this could be achieved by having a complete list of companies and being able to acquire all the 
job postings of a sample of companies selected from such a frame. In such a setting, standard 
errors would be used to express uncertainty due to the sampling procedure. When totals are 
reported (i.e., the absolute number of postings for data scientists), getting the selection prob- 
abilities right is particularly important (Lohr, 2009). In practice, one will often face a situation 
in which not all elements in the sample (companies) post all the data scientist positions, or, if the 
method of data collection is a survey, they refuse to respond to the survey request. The survey 
methodology literature has decades of publications on this topic and suggestions for adjustments 
for situations in which the mechanism leading to the missing values is well understood (see, for 
example, Bethlehem, Cobben, & Schouten, 2011; Groves & Couper, 1998; Schnell, 1997; Val- 
liant, Dever, & Kreuter, 2018; Willimack, Nichols, Elizabeth, & Sudman, 2002). Starting with 
a probability sample has the strong advantage that sampling errors can be estimated; nonresponse 
error can be adjusted for known covariates; and with sufficient information on the sampling 
frame, the coverage errors are also known. 

Of course, even if sampling and nonresponse error are adjusted for, assumptions about the 
measurement process still have to be made. Mislabeling might occur, for example, if a job is clas- 
sified as “data scientist” even if the activities do not match the label (false positive) or, conversely, 
if a job entails what is commonly understood as data science but is not explicitly labeled as such 
in the ad (false negative). 

The issue of overinterpreting the results is not new to digital trace data. We saw and still see 
this happening in the context of traditional data collections via (sample) surveys. The classic 
example where a data generating process was not understood or ignored is the Literary Digest 
poll, which incorrectly called the 1936 election (Squire, 1988). The Literary Digest went for 
volume and overlooked issues of selective access to phones and magazine subscriptions when 
assembling its mailing lists. Many of the data collection efforts in the COVID-19 pandemic 
show a similar tendency (see Kohler, 2020 and the associated special issue). Likewise, using 
Twitter data as a source to identify areas in need of support after natural disasters (e.g., hur- 
ricanes) my misguide policy makers. Resources and attention would likely flow towards the 
younger population, people with easy Internet access, or those generally well connected (Shel- 
ton, Poorthuis, Graham, & Zook, 2014). 

When the main research goal is the establishment of a causal relationship, the situation is a bit 
different (Kohler et al., 2019). If a treatment is applied with a proper randomized experiment 
or a strong non-experimental study design, then statements about such causal relationship can 
be made for anyone who had a chance to be treated. Knowing the selection probability of the 
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cases is then much less important, though it is very important that all elements have a posi- 
tive selection probability to be assigned to the treatment and control conditions (Imbens & 
Rubin, 2015). 

One example of an experiment done in a controlled fashion with digital trace data as the 
outcome is the Facebook emotional contagion study (Kramer, Guillory, & Hancock, 2014), 
where the number and type of posts seen on the users’ wall were manipulated for a random 
sample of Facebook users. Differences in posting behavior (i.e., number of posts, sentiment) 
between users who were exposed to the treatment and those who were not can be interpreted 
as the causal effect of the treatment. Similarly, in a study with 193 volunteer Japanese smart- 
phone owners who downloaded a research app, a random half of participants received on- 
screen reminders designed to stimulate interaction with communication weak ties during the 
two-month study period. The researchers compared the average number of phone calls, text 
messages, and emails from the smartphone log files to estimate the causal effect of the reminder 
messages (Kobayashi, Boase, Suzuki, & Suzuki, 2015). 

Interesting causal claims can also be made in quasi-experimental settings where external 
shocks create the treatments, and regression discontinuity or similar designs can be used. During 
the COVID-19 pandemic, digital trace data from mobile devices were used to assess the (causal) 
effects of lock-down restrictions or other interventions designed to slow the spread of the virus. 
However, in the digital trace data setting — just as with traditional data collection — detailed 
knowledge about what is measured always needs to be available to interpret a "treatment" cor- 
rectly. The Google mobility data in Figure 7.3 show how difficult it can be to differential signal 
and noise and how much pre-processing and data cleaning is still necessary. 

While these examples have internal validity (albeit to different degrees), they lack exter- 
nal validity: inference to the population at large is not possible without further assump- 
tions. In the Google mobility data example, not all units in the population have mobile 
devices that feed into this analysis. Thus any causal claim made from the data is generalizing to 
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Figure 7.3 Google mobility data. Google points out “March 12 was a widely celebrated public holiday 
in this region. Workplace and residential changes are a little different from the community’s 
response to COVID-19 but give an idea of the scale of the change. You'll need to apply 
your local knowledge, but holidays provide a very specific point of comparison” (Adapted 
from https://support.google.com/covid19-mobility/answer/9825414?hl=en-GB Accessed: 
7/21/2020) 
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the population without mobile devices. In general, should the research question involve a causal 
claim or causal inference about a larger population than what is covered by the data, then the 
same issues arise as in the descriptive setting described earlier. The causal relationship will only 
hold if the causal effect is the same for the people that had the chance to be randomized to the 
treatment and those that did not (effect homogeneity) or, in the case of the Google mobility 
data, those for whom measures can be obtained and those for whom they cannot. 

Not being able to randomize into treatment and control on a random sample of the 
population is a known problem in medical research. There the assumption of effect homo- 
geneity has often been made. More recently, medical and public health researchers have 
increasingly scrutinized these assumptions and created statistical methods that help to gen- 
eralize causal effects to the general population (DuGoff, Schuler, & Stuart, 2014). We 
would not be surprised if similar efforts will take place for digital trace data, with research- 
ers thinking hard about the behavior of people not contributing to datasets (at all or at the 
same rate). 

Prediction tasks are common in a data science pipeline when dealing with digital trace data. 
While social scientists are usually more interested in description of a population or causal effects, 
in doing either, they use predictions when specific measurements cannot be designed for the 
covariates of interest. Examples are Fitbit converting accelerometer sensor data into steps, using 
algorithms that differentiate between different motions, settings, and movements;* predicting 
voting behavior based on online news consumption (Bach et al., 2019); predicting personality 
traits based on smartphone usage (Stachl et al., 2020); or predicting the level of gentrification 
in a neighborhoods based on data about business activities from Yelp (Glaeser, Kim, & Luca, 
2018). The reason prediction models are popular for such tasks is that they “often do not 
require specific prior knowledge about the functional form of the relationship under study and 
are able to adapt to complex non-linear and non-additive interrelations between the outcome 
and its predictors while focusing specifically on prediction performance” (Kern, Klausch, & 
Kreuter, 2019, p. 73). 

Prediction tasks can work very well when large amounts of data are available, ideally for the 
exact same situation, person, or setting, and the prediction is made in close temporal proximity 
to the observation. However, the further the predicted outcome is from the data the prediction 
is based on, both temporally (e.g., predicting a “like” or “click” three months down the road) 
and conceptually (e.g., predicting an election outcome) or both (e.g., predicting an election 
outcome three months down the road), the lower the prediction success. 

The potential for (massive amounts) of digital trace data to be used in prediction tasks is 
undeniable due to its unprecedented scope and variety. However, knowing who is covered 
by the data, for which settings, and which circumstances or time frames is just as important 
here as it is in the description and causal inference setting. Without knowing the ins and outs 
of the data generating process, there are real risks of biases due to unknown or unobserved 
systematic selection with respect to a given research question. This will increasingly be an 
issue with automated decision systems used in the societal context. For example, while pre- 
dictive policing can be used to allocate police resources, it could harm society if the data do 
not represent the population at large and predictions are biased (Rodolfa, Saleiro, & Ghani, 
2020). 


Quality assessment - quality enhancement 


The quality of digital trace data is relative to the research goal. Or, to put it differently, without 
knowing the research goal, it is only in very specific circumstances possible to make an overall 
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claim about the quality of the data — or assess the fit of the data for a given research goal. Sev- 
eral frameworks exist that can help characterize data quality (see summaries in Christen, 2012; 
National Academies of Sciences, 2017). Typical elements are accuracy, completeness, consistency, 
timeliness, and accessibility. Ultimately, a researcher has to ask herself whether the data can support 
the inference she is planning to make. This situation is no different for digital trace data than for 
any other data source, for example, more traditional survey data (Schnell, Hill, & Esser, 2018). 

Klingwort and Schnell (2020) show the difficulty in using digital trace data related to 
COVID-19 data collection. They convincingly question the use of the volunteer Fitbit app data 
donations in the previously described Robert Koch Institut effort. Not only was the number 
of people installing the app insufficient to cover all the variability in the population (as of early 
May 2020), but it also suffered from sources of nonresponse due to lack of knowledge about 
the app, privacy concerns, willingness to participate, and regular device use, as well as sources of 
coverage error due to owning the appropriate device and having the necessary technical skills 
(see Figure 1 in Klingwort & Schnell, 2020). 

In addition, digital trace data can be subject to quality challenges less common in tradi- 
tional data sources. For example, easily overlooked are problems of de-duplication, with units 
appearing in found or donated digital trace data multiple times, or records representing multiple 
units without being noticed as such. Schober, Pasek, Guggenheim, Lampe, and Conrad (2016) 
describe the former in their assessment of the use of social media data for social research and 
state that individual posts can represent. The latter is a problem that easily appears when devices 
such as computers, tablets, or smartphones are used by multiple people (Hang, von Zezschwitz, 
De Luca, & Hussmann, 2012; Matthews et al., 2016; Silver et al., 2019) and likely occurs more 
often when data are collected via smart devices in households. Another problem in analyzing 
social media data is the presence of bots that would be treated as human posters when comput- 
ing the summary statistics. Data from search engines also illustrate novel challenges to data qual- 
ity in digital trace data. Search engines might change in terms of how they are designed, who 
uses them, and how users engage with them over time in ways that are out of the researcher’s 
control (Lazer, Kennedy, King, & Vespignani, 2014). 

In a more abstract way, a design feature is the possibility to gain access to data in an organic, 
found, or ready-made way (Groves, 2011; Japec et al., 2015; Salganik, 2017). While this distinction 
between data found in the wild and data collected by design highlights an important feature, 
the data collection itself (say digital trace vs. survey questions) is independent of the found vs. 
design distinction. 

Designed measurement of digital traces would mean participants are selected into the study 
and a specific technology such as an online meter, a mobile app, or a wearable specifically 
designed for a research study is used for data collection. Found data are byproducts of interac- 
tions with the world that leave digital traces; or, put differently, they arise organically. With 
found data, researchers have no control over who provides the data and how. A typical example 
for found data would be credit card transactions, postings in online search engines, or interac- 
tions with and on social media. 

In practice, we often see a mix of designed and organic data. Sometimes, when researchers 
collaborate closely with the primary entities that collect the data, they might have the chance 
to provide input into what information is captured. For example when working closely with 
government agencies, researchers might have some input into how the measurement is taken 
(i.e., fields on a form of digital health records or unemployment insurance notices). Like- 
wise, one can select respondents carefully (design the sample), but collect “found/organic” 
data that were not designed for the purpose of the research study but metered through already 
existing devices. An example is the IAB-SMART study (Kreuter et al., 2020), where existing 
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measurement instruments in smartphones (accelerometer, pedometer, GPS) are used, but meas- 
urements are taken at specific intervals or in response to an event, bringing a design element 
into the mix. One major advantage of collecting data through designated research apps is that 
this allows researchers to specifically design all aspects of the data collection process (e.g., field 
period, participants’ characteristics, particular sensors used) with a specific research question 
in mind. The controlled environment, potentially in conjunction with a probability sample, 
allows the researcher to not only to assess coverage (Keusch, Bahr, Haas, Kreuter, & Trapp- 
mann, 2020a), nonresponse (Keusch, Bahr, Haas, Kreuter, & Trappmann, 2020b), and measure- 
ment error (Bahr, Haas, Keusch, Kreuter, & Trappmann, 2020) but also to address these issues 
through weighting techniques known from survey research that would allow inference of the 
results to a larger population. 
It can be useful to ask prior to any applied research the following questions: 


e Which population is covered by the data? If not, which groups are missing? Is it even 
known which groups are missing? 

e Can the sample represent the population? If not, are certain units entirely missing, or 
are they just not represented in the proportion needed? Are the reasons they are missing 
known, and can they be measured (in which case weighting might be an option)? 

* Do I know what the measurements represent? Or do I need to generate new features from 
the digital trace data to answer the research question? How accurate are the attribute values 
in the data? Are all variables needed for the analysis in the data? 

e How timely are the data? 

e Are there data available that can be used to assess the quality of the generated features on a 
small scale? Can the small-scale assessment be generalized to the entire data? 


For computational social science to be successful in using digital trace data, we foresee that in 
most (if not all) cases, data from different sources need to be combined, either to overcome the 
problem of unknown populations of inference or to overcome the problem of missing covariates 
and overall unclear measurement properties (Christen, Ranbaduge, & Schnell, 2020; Couper, 
2013; Schnell, 2019). 


Transparency and reporting needs 


As tempting as the use of (easily available) digital trace data is, one has to keep in mind that 
there is a long path between the raw data and insights derived from the data. Because many of 
the digital trace data are by-products of processes with a purpose different from the researchers 
intent, many pre-processing steps are needed before the analyses can begin. In a complex fast- 
moving world, where platforms and processes change, digital traces will by definition be incon- 
sistent and noisy (Foster et al., 2020) and be filled with missing data, not the very least because 
of different terms and conditions for data access and use that the different platforms exhibit 
(Amaya, Bach, Keusch, & Kreuter, 2019). 

How sensitive results are to such preprocessing steps and the accompanying decisions was 
demonstrated by Conrad et al. (2018) for studies trying to create alternative indicators for con- 
sumer confidence and consumer sentiment from Twitter data. The volatility of the results raised 
skepticism among the authors and prompted them to call for best practices in generating features 
and documenting results when using Twitter data for these purposes. Such desire for best prac- 
tices and standards in reporting can be seen in many other communities as well, very promi- 
nently among statistical agencies around the world. The United National Statistics Division? 
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lists principles governing international statistical activities, including a call for transparency of 
“concepts, definitions, classifications, sources, methods and procedures employed”. Growing 
adoption of FAIR data principles — Findability, Accessibility, Interoperability, and Reusability — 
by funding agencies, journals, and research organizations further increases the need to acquire 
sufficient information about the data generating process, as well as subsequent steps in preproc- 
essing the data. It is important to realize that FAIR principles also apply to algorithms, tools, 
and workflows generating the analytic dataset, not just to the raw data or, in our case, the raw 
digital traces.^ As Wilkinson et al. (2016) state, "all scholarly digital research objects — from data 
to analytical pipelines — benefit from application of these principles, since all components of the 
research process must be available to ensure transparency, reproducibility, and reusability” (p. 1). 

This said, researchers intending to use digital trace data should be aware that even if data 
collection is cheap, there are substantial costs associated with cleaning, curating, standardizing, 
integrating, and using the new types of data (Foster et al., 2020). Novices in using digital trace 
data might benefit from reading Amaya et al. (2019) to get a sense of challenges and opportuni- 
ties when working with digital trace data from platforms, in this case Reddit. While specific 
to Reddit, the paper lists types of information one may seek to acquire prior to conducting a 
project that uses any type of social media data. 


Conclusion 


We are, without a doubt, excited about the possibilities digital trace data provide to social 
science research. The direct and often unobtrusive observation of individual behaviors and 
social interactions through digital systems produces data in breadth and depth that cannot be 
generated using traditional methods. However, for digital trace data to become a mainstream 
data source, there is still a long way to go. The initial hype has leveled off, and more and more 
research papers appear showing the challenges and limits of using digital trace data. At the same 
time, research studies that use clever designs to combine multiple data sources are on the rise. 

The combination of multiple sources is not without risk. Multiple streams of data from 
different sources can create detailed profiles of users’ habits, demographics, or well-being that 
carry the risk of unintentionally de-identifying previously anonymous data providers (Bender, 
Kreuter, Jarmin, & Lane, 2020; Deursen & Mossberger, 2018). We are hopeful that the parallel 
efforts going on right now with respect to privacy preserving record linkage (Christen et al., 
2020) and encrypted computing (Goroff, 2015) will help mitigate those risks. While both areas 
are heavily dominated by computer scientists and statisticians, we encourage social scientists to 
inject themselves into this discussion so that the solutions work not just theoretically but also 
in practice (see Oberski and Kreuter, 2020, for the controversy around the use of differential 
privacy). 

Whether multiple data sources are combined or single sources of digital traces are used, ethi- 
cal challenges arise when the digital traces are the result of organic processes with a different 
original purpose (found data). As Helen Nissenbaum (2018) clearly lays out in her framework of 
contextual integrity, one cannot or should not ignore the question of the appropriateness of data 
flows. Appropriateness is a function of conformity with contextual informational norms. To 
give a brief example: A bouncer at a nightclub might see a woman’s address as he checks her age 
to allow entrance into the club. If he later shows up at her house using the piece of information 
acquired during his job, he violates contextual informational norms. Re-purposing of digital 
trace data can violate contextual informational norms in similar ways. While the unanticipated 
secondary use constitutes the “crown jewels” of passively collected digital trace data (Tene & 
Polonetsky, 2013), users are increasingly concerned about the privacy of their data and how 
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much they can control how their personal information is used (Auxiere et al., 2019). For the 
researcher, the use creates a challenge in how to balance the risk to the participants with the 
utility of the collected data, the so-called privacy-utility trade-off (Bender et al., 2020). 


Notes 


1 At the time of writing this chapter, ResearchGate asks the users for permission to share their personal 

data (e.g., IP address, cookie identifiers) with almost 500 external partners (www.researchgate.net/ 

privacy-policy, June 30, 2020). 

https://corona-datenspende.de/science/en/ 

https://www.fitabase.com/ 

https://help.fitbit.com/articles/en_US/Help_article/1136 

https://unstats.un.org/unsd/methods/statorg/Principles_stat_activities/principles_stat_activities.asp 

6 Including code leading to data as well as metadata, data describing the data, has already been part 
of the data management plan requirements of the U.S. National Science Foundation, for example. 
https://nsf.gov/eng/general/ENG_DMP_Policy.pdf 
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